00. Linear Algebra Basics
This lecture establishes the rigorous linear algebra foundations required for convex optimization. We begin by defining the core objects—vectors, matrices, and their operations—and progress to the geometry of high-dimensional spaces through inner products, norms, and orthogonality. We then explore the crucial concept of Positive Semidefinite (PSD) matrices, which generalize non-negative numbers to the matrix setting and underpin convexity. Finally, we derive the method of least squares from a geometric perspective via orthogonal projections. This material provides the necessary language to describe convex sets, functions, and optimization problems with precision.
Prerequisites: Basic multivariable calculus and familiarity with standard matrix notation.
Forward Connections: The projection techniques introduced here are essential for the geometric interpretation of constrained optimization. PSD matrices are the cornerstone of convex quadratic programs (QP) and Semidefinite Programming (SDP) covered in Lecture 08. The four fundamental subspaces provide the geometric intuition for duality theory (Lecture 09). Advanced topics such as QR factorization, SVD, and pseudoinverses are covered in Lecture 01.
Learning Objectives
After this lecture, you will be able to:
- Analyze Fundamental Subspaces: Identify and compute the dimensions of the four fundamental subspaces of a matrix ($\mathcal{R}(A), \mathcal{N}(A), \mathcal{R}(A^\top), \mathcal{N}(A^\top)$) using the Rank-Nullity Theorem.
- Manipulate Norms and Inner Products: Apply standard and generalized inner products, vector norms ($\ell_1, \ell_2, \ell_\infty$), and their duals to establish inequalities like Cauchy-Schwarz and Hölder.
- Perform Geometric Operations: Construct orthonormal bases via Gram-Schmidt and compute orthogonal projections onto subspaces and affine sets.
- Characterize PSD Matrices: Determine positive semidefiniteness using eigenvalues, variational forms ($x^\top Ax \ge 0$), and the Schur Complement lemma.
- Solve Least Squares: Derive and solve the normal equations for overdetermined systems and interpret the solution geometrically as an orthogonal projection.
- Apply Matrix Calculus: Compute gradients and Hessians for linear, quadratic, and log-determinant functions, which are essential for first- and second-order optimization algorithms.
1. Mathematical Atoms: Sets, Functions, and Combinations
1.1 The Geometric "Atoms" of Optimization
In convex optimization, we view algebraic problems through a geometric lens. Throughout this course, you will repeatedly perform three fundamental operations. Recognizing these "atoms" allows you to parse complex problems into simple, manageable geometric components:
- Mixing (Convex Combinations): Optimization often involves "averaging" or interpolating between candidate solutions.
Example: In a portfolio optimization problem, if you have two valid asset allocations, any weighted average of them is also a valid allocation. This operation of "mixing" defines the geometry of convex sets. - Measuring (Norms & Inner Products): We need precise tools to quantify "error" and "similarity".
Example: The objective function often measures the size of the residual vector (how far is our prediction from the data?). Norms act as rulers to measure these lengths, while inner products act as protractors to measure angles (e.g., is the gradient pointing towards the optimum?). - Pullback (Preimages): Constraints are typically defined by functions, such as $g(x) \le 0$. This defines a set implicitly: the set of all $x$ that map to non-positive values.
Example: Understanding constraints as preimages ($g^{-1}((-\infty, 0])$) allows us to deduce the shape of the feasible region directly from the properties of the function $g$.
Understanding these three operations transforms optimization from abstract symbol manipulation into concrete geometric reasoning.
1.2 Scalars, Vectors, and Matrices
- Scalars: Real numbers $a \in \mathbb{R}$.
- Vectors: A vector $x \in \mathbb{R}^n$ is formally a function $\{1,\dots,n\} \to \mathbb{R}$. We represent it as an ordered $n$-tuple (column vector).
Equality: $x = y \iff x_i = y_i$ for all $i$. - Matrices: $A \in \mathbb{R}^{m \times n}$.
Matrix-Vector Product: The definition $(Ax)_i = \sum_{j=1}^n A_{ij} x_j$ is chosen specifically to enforce linearity: $A(\alpha x + \beta y) = \alpha Ax + \beta Ay$.
Column View: $Ax = \sum_{j=1}^n x_j A_{:j}$. $Ax$ is a linear combination of the columns of $A$. This is the "synthesis" view: $A$ builds a new vector from its columns.
Row View: $(Ax)_i = A_{i:}^\top x$. The $i$-th entry is the dot product of the $i$-th row with $x$. This is the "analysis" view: each row "measures" the input vector $x$.
1.3 Preliminaries: Lines and Line Segments
We work in the vector space $\mathbb{R}^n$ equipped with the standard inner product $\langle x, y \rangle = x^\top y = \sum_{i=1}^n x_i y_i$ and the induced Euclidean norm $\|x\|_2 = \sqrt{\langle x, x \rangle}$. Geometric objects like lines and segments are the building blocks of convexity.
Line through two points
Given two distinct points $x_1, x_2 \in \mathbb{R}^n$, the line passing through them is the set of all affine combinations of these points:
$$ y = \theta x_1 + (1-\theta)x_2 = x_2 + \theta(x_1 - x_2), \quad \theta \in \mathbb{R} $$As $\theta$ varies over $\mathbb{R}$, $y$ traces out the infinite line passing through $x_2$ in the direction of $x_1 - x_2$. The value $\theta=0$ corresponds to $x_2$, and $\theta=1$ corresponds to $x_1$.
Line segment between two points
The line segment connecting $x_1$ and $x_2$, denoted $[x_1, x_2]$, corresponds to restricting the parameter $\theta$ to the unit interval $[0, 1]$:
$$ y = \theta x_1 + (1-\theta)x_2, \quad 0 \le \theta \le 1 $$This parametrization is the definition of a convex combination of two points. Intuitively, a convex combination is a "weighted average" that lies physically between the points. If you imagine a mass at $x_1$ and a mass at $x_2$, the point $y$ is the center of mass when the weights are $\theta$ and $1-\theta$. A set is defined to be convex if and only if it contains the entire line segment between any pair of its points.
1.4 Functions and Preimages
A function $f: \mathcal{X} \to \mathcal{Y}$ maps domain $\mathcal{X}$ to codomain $\mathcal{Y}$. The range (or image) is the set of actual outputs: $\mathrm{range}(f) = \{f(x) : x \in \mathcal{X}\} \subseteq \mathcal{Y}$.
The Preimage Laws
For a set $B \subseteq \mathcal{Y}$, the preimage is $f^{-1}(B) = \{x \in \mathcal{X} : f(x) \in B\}$. This exists even if $f$ is not invertible. Preimages respect all set operations perfectly:
Proof: Preimage Laws
Proof: $x \in f^{-1}(B^c) \iff f(x) \notin B \iff x \notin f^{-1}(B) \iff x \in (f^{-1}(B))^c$.
Why this matters: Constraints like $g(x) \le 0$ define the set $g^{-1}((-\infty, 0])$. If $g$ is convex, the preimage of the convex set $(-\infty, 0]$ is convex.
1.5 Combinations and Hulls
Different rules for "mixing" vectors generate different geometric objects.
| Type | Coefficients $\theta_i$ | Generated Object |
|---|---|---|
| Linear | Arbitrary | Subspace (Span) |
| Affine | $\sum \theta_i = 1$ | Affine Set (Line, Plane) |
| Convex | $\sum \theta_i = 1, \theta_i \ge 0$ | Convex Set (Polygon, Solid) |
Translation Invariance of Affine Combinations
Affine combinations are "origin-independent". If $y = \sum \theta_i x_i$ with $\sum \theta_i = 1$, then shifting all points by $v$ shifts the result by $v$:
$$ \sum \theta_i (x_i + v) = \sum \theta_i x_i + \left(\sum \theta_i\right) v = y + v $$The 2-Point Convex Hull is a Segment
The set of convex combinations of $x$ and $y$ is $S = \{\theta x + (1-\theta)y : \theta \in [0, 1]\}$.
This matches the geometric definition of the segment $[x, y]$: start at $y$, move towards $x$ by fraction $\theta$.
$$ \theta x + (1-\theta)y = y + \theta(x-y) $$
A set is convex if it contains the line segment between any pair of its points.
2. Subspaces and the Four Fundamental Spaces
A linear subspace is a set of vectors that is closed under addition and scalar multiplication. Given a matrix $A \in \mathbb{R}^{m \times n}$, we can define four fundamental subspaces that characterize the behavior of the linear map $T(x) = Ax$.
- Column Space (Range): $\mathcal{R}(A) = \{Ax \mid x \in \mathbb{R}^n\} \subseteq \mathbb{R}^m$. This is the set of all possible outputs of the linear map, i.e., the span of the columns of $A$.
- Nullspace (Kernel): $\mathcal{N}(A) = \{x \in \mathbb{R}^n \mid Ax = 0\}$. This is the set of all input vectors that are mapped to the zero vector.
- Row Space: $\mathcal{R}(A^\top) \subseteq \mathbb{R}^n$. This is the span of the rows of $A$.
- Left Nullspace: $\mathcal{N}(A^\top) = \{y \in \mathbb{R}^m \mid A^\top y = 0\}$. This is the nullspace of the transpose of $A$.
Why do we care? The Left Nullspace defines the constraints on solvability. The system $Ax=b$ has a solution if and only if $b$ is orthogonal to every vector in the Left Nullspace ($b \perp \mathcal{N}(A^\top)$). This is the Fredholm Alternative.
These subspaces are linked by the Fundamental Theorem of Linear Algebra. This theorem is not just an algebraic curiosity; it is the geometric foundation for understanding when linear systems are solvable (feasibility) and when they are unique.
- Relationship 1: $\mathcal{R}(A) \perp \mathcal{N}(A^\top)$ in $\mathbb{R}^m$. The range is the orthogonal complement of the left nullspace.
- Relationship 2: $\mathcal{R}(A^\top) \perp \mathcal{N}(A)$ in $\mathbb{R}^n$. The row space is the orthogonal complement of the nullspace.
Proof: $\mathcal{R}(A) \perp \mathcal{N}(A^\top)$
The Fredholm Alternative (Theorem of Alternatives)
A crucial consequence of the orthogonality $\mathcal{R}(A) = \mathcal{N}(A^\top)^\perp$ is a solvability criterion for linear systems, known as the Fredholm Alternative. For any $A \in \mathbb{R}^{m \times n}$ and $b \in \mathbb{R}^m$, exactly one of the following two statements is true:
- Primal Feasibility: The system $Ax = b$ has a solution.
- Dual Certificate of Infeasibility: There exists a vector $y$ such that $A^\top y = 0$ but $b^\top y \neq 0$.
Interpretation: If $Ax=b$ has no solution, then $b$ does not lie in $\mathcal{R}(A)$. Since $\mathcal{R}(A)^\perp = \mathcal{N}(A^\top)$, there must be a component of $b$ that lies in $\mathcal{N}(A^\top)$. This component $y$ serves as a witness (or certificate) that proving $b$ is unreachable by $A$. This logic generalizes to Farkas' Lemma and strong duality in later lectures.
Geometric Example: $\mathbb{R}^3$
Let $A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{bmatrix}$.
Column Space $\mathcal{R}(A)$: The span of $(1,0,0)^\top$ and $(0,1,0)^\top$, i.e., the $xy$-plane ($z=0$).
Left Nullspace $\mathcal{N}(A^\top)$: We solve $A^\top y = 0$.
$$ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \begin{bmatrix} y_1 \\ y_2 \\ y_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \implies y_1 = 0, y_2 = 0 $$
So $y = (0, 0, y_3)^\top$. This is the $z$-axis.
Orthogonality: The $xy$-plane is clearly orthogonal to the $z$-axis.
Rank and Rank–Nullity Theorem
$\mathrm{rank}(A) = \dim \mathcal{R}(A)$. For $A \in \mathbb{R}^{m \times n}$: $$ \dim \mathcal{N}(A) + \mathrm{rank}(A) = n $$ This fundamental result relates the dimensions of the domain's subspaces to the rank of the mapping.
Proof of the Rank-Nullity Theorem
Since the only solution to $\sum \alpha_j A w_j = 0$ is the trivial solution $\alpha = 0$, the set $\{Aw_1, \dots, Aw_r\}$ is linearly independent.
3. Algebraic Invariants: Determinant, Trace, and Eigenvalues
While a matrix $A$ represents a linear map in a specific basis, we are often interested in properties that are intrinsic to the map itself, independent of the basis. These are called invariants.
3.1 The "Big Three" Invariants
For a square matrix $A \in \mathbb{R}^{n \times n}$ with eigenvalues $\lambda_1, \dots, \lambda_n$ (counted with algebraic multiplicity):
- Trace: $\mathrm{tr}(A) = \sum_{i=1}^n A_{ii} = \sum_{i=1}^n \lambda_i$.
- Determinant: $\det(A) = \prod_{i=1}^n \lambda_i$.
- Eigenvalues: Roots of the characteristic polynomial $p_A(\lambda) = \det(A - \lambda I)$.
Geometric Interpretation
- Determinant as Volume: $|\det(A)|$ is the factor by which the linear map scales volume. If $S$ is a set with volume $V$, then $A(S)$ has volume $|\det(A)| V$. The sign indicates orientation (preservation vs. reversal).
- Trace as Derivative: The trace is the "infinitesimal change in volume." Specifically, $\det(I + \epsilon A) \approx 1 + \epsilon \mathrm{tr}(A)$ for small $\epsilon$. This connects the trace to the derivative of the determinant, a fact we used in the matrix calculus derivation.
- Eigenvalues as Stretch Factors: If $Ax = \lambda x$, the map scales the vector $x$ by $\lambda$. If a matrix has a full set of eigenvectors, it maps a unit circle to an ellipse whose semiaxes are aligned with the eigenvectors and have lengths $|\lambda_i|$.
3.2 The "Isotropic Scaling" Lemma
While eigenvalues describe the behavior of $A$ along specific directions, sometimes the matrix acts uniformly in all directions. A classic result connects this geometric isotropy to the algebraic structure of the matrix. We ask: what if every vector is an eigenvector?
Lemma: Scalar Matrices
If every non-zero vector $x \in \mathbb{R}^n$ is an eigenvector of $A$ (i.e., $Ax = \lambda(x)x$), then $A$ must be a scalar multiple of the identity: $A = \lambda I$.
Proof of the Isotropic Scaling Lemma
However, assuming the dimension $n \ge 2$, the space is not just a line. We can choose a third vector $z$ that is linearly independent of $u$. Since $v$ is a multiple of $u$, $z$ is also linearly independent of $v$.
Applying the result from Step 1 to the pair $(u, z)$, we get $\lambda_u = \lambda_z$.
Applying it to the pair $(v, z)$, we get $\lambda_v = \lambda_z$.
By transitivity, $\lambda_u = \lambda_v$.
Geometric Intuition: A general symmetric matrix maps the unit sphere to an ellipsoid. The eigenvectors correspond to the axes of the ellipsoid. If every vector is an eigenvector, then every direction is a principal axis. The only ellipsoid with this symmetry is a sphere (isotropic scaling).
3.3 Algebraic Properties
Linearity of Trace
The trace is a linear functional: $$ \mathrm{tr}(A + B) = \mathrm{tr}(A) + \mathrm{tr}(B) \quad \text{and} \quad \mathrm{tr}(cA) = c\,\mathrm{tr}(A). $$
Proof
By definition, $\mathrm{tr}(A) = \sum_{i=1}^n a_{ii}$. Then:
$$ \mathrm{tr}(A+B) = \sum_{i=1}^n (a_{ii} + b_{ii}) = \sum_{i=1}^n a_{ii} + \sum_{i=1}^n b_{ii} = \mathrm{tr}(A) + \mathrm{tr}(B) $$Multiplicativity of Determinant
The determinant is multiplicative: $$ \det(AB) = \det(A)\det(B). $$ This reflects the geometric fact that the volume scaling of a composite map is the product of the individual scalings. Note that $\det(AB) = \det(BA)$ even if $AB \neq BA$.
Trace of a Product (Cyclic Property)
In general, $\mathrm{tr}(AB) \neq \mathrm{tr}(A)\mathrm{tr}(B)$. However, the trace is cyclic: $$ \mathrm{tr}(AB) = \mathrm{tr}(BA). $$ More generally, $\mathrm{tr}(ABC) = \mathrm{tr}(BCA) = \mathrm{tr}(CAB)$. This property is crucial for deriving matrix gradients.
Proof: $\mathrm{tr}(AB) = \mathrm{tr}(BA)$
Worked Example: Trace and Determinant Caveats
While $\mathrm{tr}(A+B) = \mathrm{tr}(A) + \mathrm{tr}(B)$, other properties are subtler.
1. Trace of Product: $\mathrm{tr}(AB) \neq \mathrm{tr}(A)\mathrm{tr}(B)$.
Let $A = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}$ and $B = \begin{bmatrix} 0 & 0 \\ 0 & 1 \end{bmatrix}$.
$\mathrm{tr}(A)=1, \mathrm{tr}(B)=1 \implies \mathrm{tr}(A)\mathrm{tr}(B)=1$.
But $AB = 0$, so $\mathrm{tr}(AB) = 0$. Thus multiplicativity fails.
2. Determinant of Sum: $\det(A+B) \neq \det(A) + \det(B)$.
A nice conceptual counterexample: Take $A = I$ and $B = -I$ (for even $n$).
$\det(A) = 1$, $\det(B) = (-1)^n = 1$.
$\det(A+B) = \det(0) = 0$.
But $\det(A) + \det(B) = 1 + 1 = 2 \neq 0$.
This inequality is true generically; $\det$ is multiplicative, not additive.
3.4 Similarity and Basis Independence
Two matrices $A$ and $B$ are similar if $B = P A P^{-1}$ for some invertible $P$. This represents the same linear operator in a different basis. The detailed mechanics of this coordinate transformation and its connection to diagonalization are covered in depth in Lecture 01.
Theorem: Invariance under Similarity
If $A$ and $B$ are similar, they share the same algebraic invariants:
- $\det(PAP^{-1}) = \det(P)\det(A)\det(P^{-1}) = \det(A)$.
- $\mathrm{tr}(PAP^{-1}) = \mathrm{tr}(AP^{-1}P) = \mathrm{tr}(A)$.
- $p_{PAP^{-1}}(\lambda) = \det(PAP^{-1} - \lambda I) = \det(P(A-\lambda I)P^{-1}) = p_A(\lambda)$.
Thus, they have the same eigenvalues.
3.5 Spectral Shift
Adding a multiple of the identity shifts the eigenvalues but preserves eigenvectors.
If $Ax = \lambda x$, then:
$$ (A + tI)x = Ax + tIx = (\lambda + t)x. $$
Thus, the eigenvalues of $A+tI$ are $\{\lambda_i + t\}$.
Warning: In general, $\det(A+B) \neq \det(A) + \det(B)$. For example, $\det(I+I) = 2^n \neq 1+1$.
4. Inner Products, Norms, and Angles
Inner Products
An inner product on $\mathbb{R}^n$ is a function $\langle \cdot, \cdot \rangle: \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}$ satisfying three axioms for all $x, y, z$ and scalars $\alpha, \beta$:
- Linearity: $\langle \alpha x + \beta z, y \rangle = \alpha\langle x, y \rangle + \beta\langle z, y \rangle$.
- Symmetry: $\langle x, y \rangle = \langle y, x \rangle$.
- Positive Definiteness: $\langle x, x \rangle \ge 0$, and $\langle x, x \rangle = 0 \iff x = 0$.
These axioms ensure that "lengths" and "angles" behave as expected. The standard (Euclidean) inner product is defined as $\langle x, y \rangle = x^\top y = \sum_{i=1}^n x_i y_i$.
Proof: Dot Product is an Inner Product
Weighted Inner Products
For any symmetric positive definite matrix $P \in \mathbb{S}^n_{++}$, we can define a weighted inner product: $$ \langle x, y \rangle_P := x^\top P y $$ This induces the P-norm $\|x\|_P = \sqrt{x^\top P x}$, which measures distance in a skewed coordinate system (related to ellipsoids).
Norms and Convexity
A norm is a function $\|\cdot\|: \mathbb{R}^n \to \mathbb{R}$ assigning a length to each vector. Axioms:
- Definiteness: $\|x\| \ge 0$, and $\|x\| = 0 \iff x=0$.
- Homogeneity: $\|\alpha x\| = |\alpha| \|x\|$.
- Triangle Inequality: $\|x+y\| \le \|x\| + \|y\|$.
The $\ell_1$ and $\ell_\infty$ Norms (Rigorous Proofs)
- $\ell_1$ (Manhattan): $\|x\|_1 = \sum_{i=1}^n |x_i|$.
- $\ell_\infty$ (Chebyshev): $\|x\|_\infty = \max_{i} |x_i|$.
Proof: Triangle Inequalities for $\ell_1$ and $\ell_\infty$
Unit Balls are Convex
A crucial geometric property: The unit ball $B = \{x : \|x\| \le 1\}$ of any norm is a convex set.
Proof: Unit Ball Convexity
Let $x, y \in B$, so $\|x\| \le 1, \|y\| \le 1$. For any $\theta \in [0, 1]$:
$$ \|\theta x + (1-\theta)y\| \le \|\theta x\| + \|(1-\theta)y\| = \theta\|x\| + (1-\theta)\|y\| \le \theta(1) + (1-\theta)(1) = 1 $$Thus $\theta x + (1-\theta)y \in B$. This means norm constraints $f(x) = \|Ax-b\| \le t$ define convex feasible sets.
The Euclidean Geometry ($\ell_2$)
The Euclidean norm $\|x\|_2 = \sqrt{x^\top x}$ is special because it arises from an inner product.
Foundational Scalar Lemmas
- 1D Triangle Inequality: $|a+b| \le |a| + |b|$.
Proof: Squaring both sides (non-negative) gives $(a+b)^2 \le (|a|+|b|)^2 \iff a^2+2ab+b^2 \le a^2+2|a||b|+b^2 \iff ab \le |ab|$, which is always true. - Quadratic Non-negativity: If $q(t) = at^2 + bt + c \ge 0$ for all $t$ (with $a > 0$), then the discriminant $b^2 - 4ac \le 0$.
Reason: If the discriminant were positive, the quadratic would have two real roots and dip below zero between them.
Proof: Cauchy-Schwarz Inequality ($|x^\top y| \le \|x\|_2 \|y\|_2$)
Proof: Triangle Inequality for $\ell_2$
Using Cauchy-Schwarz:
$$ \|x+y\|_2^2 = \|x\|_2^2 + 2\langle x, y \rangle + \|y\|_2^2 \le \|x\|_2^2 + 2\|x\|_2\|y\|_2 + \|y\|_2^2 = (\|x\|_2 + \|y\|_2)^2 $$Taking square roots gives $\|x+y\|_2 \le \|x\|_2 + \|y\|_2$.
Dual Norms and Polar Sets
The dual norm provides a variational definition of size. For any norm $\|\cdot\|$, its dual $\|\cdot\|_*$ is: $$ \|y\|_* = \sup_{\|x\| \le 1} x^\top y $$ This measures the maximum "correlation" $y$ can have with any unit vector.
Theorem: The Dual Norm is a Norm
Generalized Hölder's Inequality
From the definition, we immediately get the fundamental inequality: $$ x^\top y \le \|x\| \|y\|_* $$ Proof: If $x=0$, trivial. If $x \neq 0$, let $u = x/\|x\|$. Then $\|u\|=1$, so $u^\top y \le \|y\|_* \implies (x/\|x\|)^\top y \le \|y\|_* \implies x^\top y \le \|x\| \|y\|_*$.
Canonical Dual Pairs
- $\ell_2$ is self-dual: $\|y\|_{2,*} = \|y\|_2$. (Aligned by $x = y/\|y\|_2$).
- $\ell_1$ dual is $\ell_\infty$: $\|y\|_{1,*} = \|y\|_\infty$. (Maximize $x^\top y$ on diamond $\to$ pick corner).
- $\ell_\infty$ dual is $\ell_1$: $\|y\|_{\infty,*} = \|y\|_1$. (Maximize $x^\top y$ on box $\to$ match signs).
Lemma: Conjugate of a Norm
This lemma is crucial for dual derivations. Let $\|\cdot\|$ be any norm and $\|\cdot\|_*$ its dual norm. The conjugate of the scaled norm $f(x) = \rho \|x\|$ is the indicator function of the dual ball: $$ (\rho\|x\|)^*(z) = I_{\{\|z\|_* \le \rho\}}(z) = \begin{cases} 0 & \|z\|_* \le \rho \\ \infty & \text{otherwise} \end{cases} $$ Proof sketch: If $\|z\|_* \le \rho$, then $z^\top x - \rho\|x\| \le \|z\|_*\|x\| - \rho\|x\| \le 0$ (max is 0 at $x=0$). If $\|z\|_* > \rho$, we can align $x$ with $z$ to make the expression go to $+\infty$.
Polar Sets
The polar of a convex set $C$ is $C^\circ = \{y : y^\top x \le 1 \ \forall x \in C\}$.
For a unit ball $B = \{x : \|x\| \le 1\}$, the polar is exactly the dual unit ball:
$$ B^\circ = \{y : \|y\|_* \le 1\} $$
This geometric duality underpins the duality of optimization problems.
Spectral Norm and Convexity
Induced Matrix Norms
The spectral norm is a specific instance of an induced norm (or operator norm). For any vector norms $\|\cdot\|_a$ on the domain and $\|\cdot\|_b$ on the codomain, the induced norm of a matrix $X$ is: $$ \|X\|_{a,b} = \sup_{v \neq 0} \frac{\|Xv\|_b}{\|v\|_a} = \sup_{\|v\|_a = 1} \|Xv\|_b $$ This measures the maximum amplification of the $b$-norm of the output relative to the $a$-norm of the input.
Common Operator Norms
When the domain and codomain are equipped with the same $\ell_p$ norm ($a=b=p$), we write $\|X\|_p$.
- Spectral Norm ($p=2$): $\|X\|_2 = \sigma_{\max}(X)$. (Max stretch of Euclidean length).
- Max Column Sum Norm ($p=1$): The operator norm induced by the $\ell_1$ norm is the maximum absolute column sum. $$ \|X\|_1 = \max_{j} \sum_{i} |X_{ij}| $$ Intuition: The unit ball of $\ell_1$ is a diamond with vertices at $e_j$. The maximum stretch occurs at one of these vertices $X e_j$ (the $j$-th column).
- Max Row Sum Norm ($p=\infty$): The operator norm induced by the $\ell_\infty$ norm is the maximum absolute row sum. $$ \|X\|_\infty = \max_{i} \sum_{j} |X_{ij}| $$ Intuition: This is the dual of the 1-norm case.
Convexity: All induced norms are convex functions of the matrix $X$. This follows because $\|X\|_{a,b} = \sup_{\|u\|_{b^*} \le 1, \|v\|_a \le 1} u^\top X v$, which is a pointwise supremum of linear functions.
Spectral Norm Details
The spectral norm (or operator norm) of a matrix $X \in \mathbb{R}^{m \times n}$ is induced by the Euclidean vector norm: $$ \|X\|_2 = \sup_{v \neq 0} \frac{\|Xv\|_2}{\|v\|_2} = \sup_{\|v\|_2=1} \|Xv\|_2 $$
Intuition: Worst-Case Stretch. The operator norm measures the maximum factor by which the matrix $X$ can "stretch" a vector. It corresponds to the "Lipschitz constant" of the linear map. Geometrically, it is the length of the longest semi-axis of the ellipsoid formed by mapping the unit sphere.
It turns out this equals the maximum singular value: $$ \|X\|_2 = \sigma_{\max}(X) = \sqrt{\lambda_{\max}(X^\top X)} $$ Since $X^\top X$ is always Positive Semidefinite ($v^\top X^\top X v = \|Xv\|_2^2 \ge 0$), its eigenvalues are non-negative.
Proof: $\|X\|_2 = \sqrt{\lambda_{\max}(X^\top X)}$
Dual Representation (Crucial for Convexity Proofs): Similar to the max eigenvalue, the spectral norm can be written as a supremum: $$ \|X\|_2 = \sup_{\|u\|_2=1, \|v\|_2=1} u^\top X v $$ Since $f(X) = u^\top X v$ is a linear function of $X$, and $\|X\|_2$ is the supremum of these linear functions, $\|X\|_2$ is a convex function.
Matrix Inner Product and Frobenius Norm
For matrices $X, Y \in \mathbb{R}^{m \times n}$, the trace inner product (also called the Hilbert-Schmidt inner product) is: $$ \langle X, Y \rangle = \mathrm{tr}(X^\top Y) = \sum_{i=1}^m \sum_{j=1}^n X_{ij} Y_{ij} $$ The Frobenius norm (or Hilbert-Schmidt norm) is $\|X\|_F = \sqrt{\langle X, X \rangle}$.
Derivation: Trace and Frobenius Norm
We rigorously show that the Frobenius norm $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2}$ is induced by the trace inner product $\langle A, B \rangle = \mathrm{tr}(A^\top B)$.
Submultiplicativity of the Frobenius Norm
For compatible matrices $A, B$, we have the inequality $\|AB\|_F \le \|A\|_F \|B\|_F$. This property is crucial for analyzing the convergence of matrix iterations.
Proof via Column/Row Factorization
We prove this inequality by decomposing the matrix multiplication into vector inner products and applying Cauchy-Schwarz.
Denote the $i$-th column of $A^\top$ (which is the $i$-th row of $A$) as $A_i \in \mathbb{R}^n$.
Denote the $j$-th column of $B$ as $B^j \in \mathbb{R}^n$.
The $(i,j)$ entry of $C$ is the inner product: $C_{ij} = \langle A_i, B^j \rangle$.
5. Orthogonality and Orthonormal Bases
Orthonormal Sets
A set of vectors $q_1, \dots, q_k$ is orthonormal if its elements are mutually orthogonal and each has a norm of 1. Formally, $q_i^\top q_j = \delta_{ij}$, where $\delta_{ij}$ is the Kronecker delta (1 if $i=j$, 0 otherwise). A square matrix $Q$ with orthonormal columns is an orthogonal matrix, satisfying the important property $Q^\top Q = I$, which means $Q^{-1} = Q^\top$. Orthonormal bases are computationally desirable because they are numerically stable and simplify many calculations, such as projections.
Angle Preservation and Orthogonal Matrices
A linear map $A$ preserves angles if $\frac{\langle Ax, Ay \rangle}{\|Ax\|\|Ay\|} = \frac{\langle x, y \rangle}{\|x\|\|y\|}$ for all $x, y$.
A stronger condition is that $A$ preserves inner products: $\langle Ax, Ay \rangle = \langle x, y \rangle$.
Using the adjoint property, $\langle x, A^\top A y \rangle = \langle x, y \rangle$ for all $x, y$, which implies $A^\top A = I$.
Thus, matrices that preserve geometry (lengths and angles) are exactly the orthogonal matrices (possibly scaled).
The Orthogonal Group $O_n$
The set of all $n \times n$ orthogonal matrices forms a group, denoted $O_n$.
- Closure: If $Q_1, Q_2 \in O_n$, then $(Q_1 Q_2)^\top (Q_1 Q_2) = Q_2^\top Q_1^\top Q_1 Q_2 = Q_2^\top I Q_2 = I$.
- Inverse: $Q^{-1} = Q^\top$, which is also orthogonal.
- Compactness: $O_n$ is a compact set in $\mathbb{R}^{n \times n}$. It is closed (preimage of $I$ under continuous map $f(A)=A^\top A$) and bounded (columns have unit norm).
The Gram-Schmidt Process
The Gram-Schmidt process is an algorithm for constructing an orthonormal basis from a set of linearly independent vectors. Starting with a vector, it iteratively subtracts the components that lie in the direction of the previously processed vectors, leaving a new, orthogonal vector that is then normalized. $$ \tilde{q}_k = a_k - \sum_{i=1}^{k-1} (q_i^\top a_k) q_i, \quad q_k = \frac{\tilde{q}_k}{\|\tilde{q}_k\|_2} $$
The QR Decomposition
The QR decomposition expresses a matrix $A$ as the product of an orthonormal matrix $Q$ and an upper triangular matrix $R$. This decomposition is a direct outcome of the Gram-Schmidt process and is fundamental to numerical linear algebra. For a matrix $A \in \mathbb{R}^{m \times n}$ with full column rank, we can write $A = QR$, where $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns and $R \in \mathbb{R}^{n \times n}$ is upper-triangular.
Applications:
- Solving Linear Systems: The system $Ax=b$ becomes $QRx=b$, which simplifies to $Rx = Q^\top b$. This is easily solved using back substitution and is far more numerically stable than forming the normal equations.
- Projections: The projection onto the column space of $A$ is given by $P = QQ^\top$.
6. Positive Semidefinite Matrices
6.1 The Space of Symmetric Matrices ($\mathbb{S}^n$)
In convex optimization, a central object is the Hessian matrix $\nabla^2 f(x)$. By Schwarz's Theorem, if a function is twice continuously differentiable, its Hessian is symmetric. A matrix $A \in \mathbb{R}^{n \times n}$ is symmetric if $A = A^\top$. The set of such matrices is denoted $\mathbb{S}^n$.
The Spectral Theorem
If $A \in \mathbb{S}^n$, then:
- All eigenvalues ($\lambda_1, \dots, \lambda_n$) are real numbers.
- There exists a set of orthonormal eigenvectors $q_1, \dots, q_n$.
- $A$ can be diagonalized as $A = Q \Lambda Q^\top$, where $Q$ is an orthogonal matrix ($Q^\top Q = I$) containing the eigenvectors, and $\Lambda$ is a diagonal matrix containing the eigenvalues.
Why are Eigenvectors of Symmetric Matrices Orthogonal?
The Spectral Theorem guarantees orthogonal eigenvectors for symmetric matrices. The proof relies on the adjoint property.
Proof of Orthogonality
Let $A$ be symmetric ($A^\top = A$). Let $v_1, v_2$ be eigenvectors with distinct eigenvalues $\lambda_1 \neq \lambda_2$. $$ A v_1 = \lambda_1 v_1, \quad A v_2 = \lambda_2 v_2 $$ Consider the inner product $\langle A v_1, v_2 \rangle$. We can evaluate it two ways:
- $\langle A v_1, v_2 \rangle = \langle \lambda_1 v_1, v_2 \rangle = \lambda_1 \langle v_1, v_2 \rangle$
- $\langle A v_1, v_2 \rangle = \langle v_1, A^\top v_2 \rangle = \langle v_1, A v_2 \rangle = \langle v_1, \lambda_2 v_2 \rangle = \lambda_2 \langle v_1, v_2 \rangle$
Subtracting gives $(\lambda_1 - \lambda_2) \langle v_1, v_2 \rangle = 0$. Since $\lambda_1 \neq \lambda_2$, we must have $\langle v_1, v_2 \rangle = 0$.
Implication for convexity: The "shape" of a multivariable function (bowl vs. saddle) is entirely determined by the signs of the eigenvalues of its Hessian.
6.2 Positive Semidefinite (PSD) Matrices
This definition is central to the course. It provides the matrix equivalent of a "non-negative number" and underpins the definition of Convex Functions (Lecture 05) and Semidefinite Programming (Lecture 08).
Forward Connection: The PSD Cone
The set of all PSD matrices forms a convex cone, denoted $\mathbb{S}^n_+$. This geometric object is the foundation of Semidefinite Programming (SDP), a powerful generalization of linear programming that we will study in Lecture 08. As LP optimizes over the non-negative orthant, SDP optimizes over the PSD cone.
Definition A: The Variational Definition
A symmetric matrix $A \in \mathbb{S}^n$ is Positive Semidefinite (PSD), denoted as $A \succeq 0$, if:
$$ x^\top A x \ge 0 \quad \text{for all } x \in \mathbb{R}^n $$Geometric Intuition (Generalized Non-Negative Numbers): Just as a non-negative number $a \ge 0$ allows us to define convex quadratic functions like $f(x) = ax^2$, a PSD matrix allows us to define convex quadratic forms in higher dimensions. If $A \succeq 0$, the function $f(x) = x^\top A x$ looks like a bowl. If $A$ has a negative eigenvalue, the function looks like a saddle (curves down in some directions), breaking convexity.
Curvature: If $A$ is the Hessian $\nabla^2 f(x)$, the term $x^\top A x$ is the second derivative along the line defined by vector $x$. $A \succeq 0$ means the function has non-negative curvature in every possible direction.
Definition B: The Eigenvalue Definition
A symmetric matrix $A \in \mathbb{S}^n$ is PSD if and only if all its eigenvalues are non-negative:
$$ \lambda_i(A) \ge 0 \quad \text{for all } i = 1, \dots, n $$Factorization of PSD Matrices (Gram Matrix)
A symmetric matrix $A$ is PSD if and only if it can be written as a Gram matrix: $A = B^\top B$ for some matrix $B$ (not necessarily square).
Proof: If $A \succeq 0$, spectral decomposition gives $A = Q \Lambda Q^\top = (Q \Lambda^{1/2}) (Q \Lambda^{1/2})^\top$. Let $B = (Q \Lambda^{1/2})^\top$. This $B$ is a "square root" of $A$.
Conversely, if $A = B^\top B$, then $x^\top A x = x^\top B^\top B x = (Bx)^\top (Bx) = \|Bx\|_2^2 \ge 0$.
Proof: Equivalence of Definitions
We prove that $(x^\top A x \ge 0) \iff (\lambda_i \ge 0)$.
6.3 Geometry of PSD Matrices: Ellipsoids
Positive Definite (PD) matrices define the shape of ellipsoids. For a matrix $P \in \mathbb{S}^n_{++}$, the set: $$ \mathcal{E} = \{ x \in \mathbb{R}^n \mid x^\top P x \le 1 \} $$ is an ellipsoid centered at the origin.
- Axes alignment: The axes of the ellipsoid are aligned with the eigenvectors of $P$.
- Axis lengths: The semi-axis lengths are $1/\sqrt{\lambda_i}$, where $\lambda_i$ are the eigenvalues of $P$. Note the inverse relationship: a large eigenvalue means steep curvature in the quadratic bowl, which corresponds to a short axis in the level set ellipsoid (the function rises quickly).
This geometric link explains why "conditioning" matters. If the ratio $\lambda_{\max}/\lambda_{\min}$ is large, the ellipsoid is very long and thin (a deep valley in one direction, flat in another), making optimization difficult.
6.4 The Rayleigh Quotient and Eigenvalues
The maximum eigenvalue $\lambda_{\max}(A)$ can be defined as an optimization problem:
$$ \lambda_{\max}(A) = \sup_{x \neq 0} \frac{x^\top A x}{x^\top x} = \sup_{\|x\|_2 = 1} x^\top A x $$This is the Rayleigh Quotient. This definition proves that $\lambda_{\max}(A)$ is a convex function of $A$. Notice that for a fixed $x$, the function $g(A) = x^\top A x$ is linear in $A$. Since $\lambda_{\max}(A)$ is the pointwise supremum of a family of linear functions (indexed by $x$), it is a convex function.
6.5 The Schur Complement Lemma
A powerful tool for handling block matrices in optimization is the Schur Complement. It allows us to convert complicated nonlinear constraints into Linear Matrix Inequalities (LMIs) and provides a systematic way to check positive semidefiniteness.
💡 Intuition: Completing the Square
The Schur Complement is the matrix generalization of "completing the square". Consider a scalar quadratic form in two variables $x, y$:
$$ q(x, y) = ax^2 + 2bxy + cy^2 $$To determine if this is non-negative for all $x, y$, we complete the square with respect to $x$ (assuming $a > 0$):
$$ a(x + \frac{b}{a}y)^2 + (c - \frac{b^2}{a})y^2 $$The first term is always non-negative. The sign of the entire expression depends on the second term coefficient: $S = c - b^2/a$.
In the matrix case $\begin{bmatrix} A & B \\ B^\top & C \end{bmatrix}$, the term $S = C - B^\top A^{-1} B$ plays exactly the same role. The block diagonal decomposition $ME$ derived below is the matrix equivalent of this algebraic rearrangement.
Why is this useful?
Often in optimization we encounter nonlinear inequalities like $\frac{x^2}{y} \le t$ or $\|Ax\|_2^2 \le c$. These look difficult to handle with linear algebra. The Schur Complement allows us to rewrite them as linear constraints on a matrix variable. For example, $x^2 \le ty$ (with $y>0$) is equivalent to the matrix condition $\begin{bmatrix} y & x \\ x & t \end{bmatrix} \succeq 0$. This "lifting" trick is the key to Semidefinite Programming.
1. Block Matrix Setup and Dimensions
Consider a block matrix $M$ partitioned as follows:
$$ M = \begin{bmatrix} A & B \\ C & D \end{bmatrix} $$where $A \in \mathbb{R}^{p \times p}$, $B \in \mathbb{R}^{p \times q}$, $C \in \mathbb{R}^{q \times p}$, and $D \in \mathbb{R}^{q \times q}$. Assume the bottom-right block $D$ is invertible.
The Schur complement of $D$ in $M$ is the matrix defined as:
$$ S := A - B D^{-1} C $$2. Constructing the Elimination Matrix
To understand the origin of $S$, we apply Block Gaussian Elimination. The goal is to zero out the $C$ block (bottom-left) to obtain a block upper-triangular matrix, from which properties like determinant and invertibility are easily read.
We achieve this by post-multiplying $M$ by an elimination matrix $E$:
$$ E = \begin{bmatrix} I_p & 0 \\ -D^{-1}C & I_q \end{bmatrix} $$This matrix $E$ is block lower-triangular with identity matrices on the diagonal, so $\det(E) = 1$. It acts as a "column operation": it subtracts $D^{-1}C$ times the second column-block from the first column-block.
Multiplying $M$ by $E$ yields the factorization:
$$ ME = \begin{bmatrix} A & B \\ C & D \end{bmatrix} \begin{bmatrix} I_p & 0 \\ -D^{-1}C & I_q \end{bmatrix} = \begin{bmatrix} A - BD^{-1}C & B \\ 0 & D \end{bmatrix} = \begin{bmatrix} S & B \\ 0 & D \end{bmatrix} $$Line-by-Line Verification of Block Multiplication
We verify this multiplication explicitly using the standard block multiplication rule $\begin{bmatrix} A & B \\ C & D \end{bmatrix} \begin{bmatrix} X & Y \\ Z & W \end{bmatrix} = \begin{bmatrix} AX+BZ & AY+BW \\ CX+DZ & CY+DW \end{bmatrix}$.
Here we set $X=I_p, Y=0, Z=-D^{-1}C, W=I_q$.
- Top-left (The Pivot): $A(I_p) + B(-D^{-1}C) = A - BD^{-1}C = S$. This term is exactly the "pivot" remaining after elimination.
- Top-right: $A(0) + B(I_q) = B$. (Unchanged).
- Bottom-left (The Zero): $C(I_p) + D(-D^{-1}C) = C - C = 0$. (This was the specific design goal of the elimination step).
- Bottom-right: $C(0) + D(I_q) = D$. (Unchanged).
💡 Choice of Elimination Matrix
The goal is to eliminate the bottom-left block $C$, analogous to Gaussian elimination. We require the $(2,1)$ block of the product to be zero: $$ C \cdot I_p + D \cdot Z = 0 \implies D Z = -C \implies Z = -D^{-1}C $$ This determines the entry $-D^{-1}C$ in the elimination matrix $E$.
Since $ME$ is block upper-triangular, its determinant is the product of the determinants of its diagonal blocks: $\det(ME) = \det(S)\det(D)$. Because $\det(E) = 1$, we derive the Determinant Identity:
$$ \det(M) = \det(D) \det(A - B D^{-1} C) $$Numerical Example (Sanity Check)
Let $A=[2], B=[1], C=[3], D=[4]$. Then $M = \begin{bmatrix} 2 & 1 \\ 3 & 4 \end{bmatrix}$.
- Direct Determinant: $\det(M) = 2(4) - 1(3) = 8 - 3 = 5$.
- Schur Complement: $S = A - B D^{-1} C = 2 - 1(\frac{1}{4})3 = 2 - 0.75 = 1.25$.
- Formula Check: $\det(D)\det(S) = 4(1.25) = 5$.
It matches perfectly.
3. Symmetric Variant: Schur Complement of $A$
Similarly, if $A$ is invertible, we can eliminate $C$ using the Schur complement of $A$, defined as $D - C A^{-1} B$. The determinant identity becomes:
$$ \det(M) = \det(A) \det(D - C A^{-1} B) $$4. Schur Complement and Positive Semidefiniteness
This is the crucial property for convex optimization. Let $M$ be a symmetric matrix ($C = B^\top$):
$$ M = \begin{bmatrix} A & B \\ B^\top & D \end{bmatrix} $$Theorem: If $D \succ 0$, then $M \succeq 0$ if and only if the Schur complement $S = A - B D^{-1} B^\top \succeq 0$.
Let $z = \begin{bmatrix} x \\ y \end{bmatrix}$. The quadratic form is: $$ z^\top M z = x^\top A x + 2 x^\top B y + y^\top D y $$ We want to choose $y$ to minimize this expression for a fixed $x$, effectively "concentrating" the condition onto $x$. Setting the gradient with respect to $y$ to zero: $$ \nabla_y (z^\top M z) = 2B^\top x + 2Dy = 0 \implies y = -D^{-1}B^\top x $$ Now, we substitute this optimal $y$ back into the quadratic form: $$ z^\top M z = x^\top A x + 2 x^\top B (-D^{-1}B^\top x) + (-D^{-1}B^\top x)^\top D (-D^{-1}B^\top x) $$ $$ = x^\top A x - 2 x^\top B D^{-1} B^\top x + x^\top B D^{-1} D D^{-1} B^\top x $$ $$ = x^\top A x - 2 x^\top B D^{-1} B^\top x + x^\top B D^{-1} B^\top x $$ $$ = x^\top (A - B D^{-1} B^\top) x = x^\top S x $$ Since $M \succeq 0$, we must have $z^\top M z \ge 0$. Therefore, $x^\top S x \ge 0$ for all $x$, which means $S \succeq 0$.
Example: Determinant of Symmetric Block Matrix
Consider the matrix $M = \begin{bmatrix} Y & x \\ x^\top & t \end{bmatrix}$ where $Y \succ 0$. Using the general determinant formula with $A=Y, B=x, C=x^\top, D=t$: $$ \det M = \det Y \cdot \det(t - x^\top Y^{-1} x) = (\det Y)(t - x^\top Y^{-1} x) $$ This scalar formula is frequently used to check if $M \succ 0$: we need $\det Y > 0$ (given) and the Schur complement $t - x^\top Y^{-1} x > 0$.
Forward Connection: This result can also be viewed as minimizing the quadratic form $f(x, y) = [x^\top \ y^\top] M [x^\top \ y^\top]^\top$ with respect to $y$ (partial minimization), yielding a new quadratic form in $x$ defined by the Schur complement matrix. This perspective is explored in Lecture 05.
The Triangle of Equivalence
For a positive definite matrix $Y \succ 0$, the following three conditions are equivalent. They represent three different "faces" of the same object:
$$ \boxed{ \begin{matrix} \textbf{Scalar Inequality} & & \textbf{Block Matrix LMI} \\ t \ge x^\top Y^{-1} x & \iff & \begin{bmatrix} Y & x \\ x^\top & t \end{bmatrix} \succeq 0 \\ & \Updownarrow & \\ & \textbf{Rank-1 Update} & \\ & tY - xx^\top \succeq 0 & \end{matrix} } $$- Scalar: Useful for epigraphs of quadratic-over-linear functions.
- Block Matrix: Useful for Semidefinite Programming (SDP) constraints.
- Rank-1 Update: Useful for algebraic manipulation ($tY \succeq xx^\top$).
Detailed Derivation: From Rank-1 to Scalar
We want to show $tY - xx^\top \succeq 0 \iff t \ge x^\top Y^{-1} x$.
Application: This lemma turns the nonlinear constraint $\|Bx\|_2^2 \le c$ (equivalent to $x^\top B^\top B x \le c$) into an LMI: $$ \begin{bmatrix} I & Bx \\ (Bx)^\top & c \end{bmatrix} \succeq 0 $$ This is central to Semidefinite Programming.
A positive definite matrix $Q$ can be used to define a generalized norm, $\|x\|_Q = \sqrt{x^\top Q x}$. The unit ball for this norm, $\{x \mid x^\top Q x \le 1\}$, is an ellipsoid, whose geometry is determined by the eigenvalues and eigenvectors of $Q$.
6.6 The Loewner Order
For symmetric matrices $X, Y \in \mathbb{S}^n$, we write $X \succeq Y$ if and only if $X - Y$ is positive semidefinite (PSD), meaning:
$$ v^\top (X - Y) v \ge 0 \quad \text{for all } v \in \mathbb{R}^n $$Key Properties of the Loewner Order
- Eigenvalue characterization: $X \succeq 0$ if and only if all eigenvalues of $X$ are nonnegative.
- Inner product property: If $X \succeq 0$ and $Y \succeq 0$, then: $$ \langle X, Y \rangle = \mathrm{tr}(XY) \ge 0 $$ (This follows because $XY$ has nonnegative trace when both are PSD.)
- Partial order: The relation $\succeq$ is reflexive, transitive, and antisymmetric (hence a partial order).
The Loewner order is the language of:
- Ellipsoids (defined by PSD matrices in quadratic forms)
- PSD cone $\mathbb{S}^n_+ = \{X \in \mathbb{S}^n \mid X \succeq 0\}$ in Lecture 03
- Semidefinite programs (SDPs) in Lecture 08
7. Projections onto Subspaces and Affine Sets
The concept of "finding the closest point" is the geometric heart of optimization. Whether solving least squares, applying constraints, or analyzing duality, we are often just computing an orthogonal projection.
7.1 Projection onto a Line (1D Case)
Before tackling general subspaces, let's rigorously derive the projection of a vector $x$ onto the line spanned by a single non-zero vector $a \in \mathbb{R}^n$. This simple case contains all the intuition needed for the general theory.
We seek the scalar $\alpha$ that minimizes the squared distance $\|x - \alpha a\|_2^2$. $$ f(\alpha) = \|x - \alpha a\|_2^2 = (x - \alpha a)^\top (x - \alpha a) = \|x\|_2^2 - 2\alpha (a^\top x) + \alpha^2 \|a\|_2^2 $$ This is a simple convex quadratic in $\alpha$. Setting the derivative to zero: $$ f'(\alpha) = -2(a^\top x) + 2\alpha \|a\|_2^2 = 0 \implies \alpha^\star = \frac{a^\top x}{\|a\|_2^2} $$ The projection is the vector $p = \alpha^\star a$: $$ p = \frac{a^\top x}{a^\top a} a = \left( \frac{a a^\top}{a^\top a} \right) x $$
Geometric Insight: Orthogonality of the Residual
Let $r = x - p$ be the error (residual) vector. Check its inner product with $a$: $$ a^\top r = a^\top \left( x - \frac{a^\top x}{a^\top a} a \right) = a^\top x - \frac{a^\top x}{a^\top a} (a^\top a) = a^\top x - a^\top x = 0 $$ The error is orthogonal to the subspace (line). This Orthogonality Principle is universal: the optimal error is always perpendicular to the feasible set (for subspaces and affine sets).
7.2 Orthogonal Projection onto a General Subspace
Let $\mathcal{S} \subseteq \mathbb{R}^m$ be a subspace and $b \in \mathbb{R}^m$. The orthogonal projection $p \in \mathcal{S}$ of $b$ is the unique vector in $\mathcal{S}$ minimizing $\|b - p\|_2$. It is characterized by the condition that the residual is orthogonal to the entire subspace: $$ b - p \perp \mathcal{S} \quad \iff \quad v^\top(b - p) = 0 \ \ \forall v \in \mathcal{S} $$
Projection via a Basis
If $Q \in \mathbb{R}^{m \times k}$ has orthonormal columns spanning $\mathcal{S}$, the projector is: $$ P = QQ^\top \quad \text{and} \quad p = Pb = QQ^\top b $$ Check: $P^2 = P$ (idempotent) and $P^\top = P$ (symmetric)—that's what makes it an orthogonal projector.
Projection onto $\mathcal{R}(A)$ using $A$
If $A \in \mathbb{R}^{m \times n}$ has full column rank: $$ P = A(A^\top A)^{-1} A^\top \quad \text{and} \quad p = Pb $$
Projection onto an Affine Set
Consider an affine set defined by $\mathcal{A} = \{ x \in \mathbb{R}^n \mid Fx = g \}$, where $F \in \mathbb{R}^{p \times n}$ has full row rank and $g \in \mathbb{R}^p$. Projecting onto this set reduces to subspace projection by translating. This concept is generalized in Lecture 01 to projection onto convex sets (POCS).
Explicit Formula via Least Norm
We solve the problem $\min \|x - y\|_2^2$ s.t. $Fx = g$. The KKT conditions yield the linear system: $$ \begin{bmatrix} I & F^\top \\ F & 0 \end{bmatrix} \begin{bmatrix} x \\ \nu \end{bmatrix} = \begin{bmatrix} y \\ g \end{bmatrix} $$ Solving for $x$ gives the explicit projection formula: $$ \Pi_{\mathcal{A}}(y) = y - F^\top (F F^\top)^{-1} (F y - g) $$ This is often more practical than constructing a nullspace basis $Z$.
Parametric Representation
Every point in $\mathcal{A}$ can be written as $x = x_0 + Zt$, where:
- $x_0$ is any particular solution satisfying $Fx_0 = g$.
- $Z$ is a matrix whose columns form a basis for $\mathcal{N}(F)$ (the nullspace of $F$).
- $t \in \mathbb{R}^k$ where $k = n - p$ (dimension of the nullspace).
Euclidean Projection Formula (Explicit via KKT)
Alternatively, using Lagrange multipliers (a technique we will explore deeply in Lecture 09), we can derive the projection without forming a nullspace basis.
$$ \Pi_{\mathcal{A}}(y) = y - F^\top (F F^\top)^{-1} (F y - g) $$Derivation via Lagrange Multipliers
Euclidean Projection Formula (Nullspace)
The Euclidean projection of any point $y \in \mathbb{R}^n$ onto $\mathcal{A}$ can also be computed via the nullspace basis:
$$ \Pi_{\mathcal{A}}(y) = x_0 + Z(Z^\top Z)^{-1} Z^\top (y - x_0) $$Geometric Interpretation and Derivation
This formula says: "Project $(y - x_0)$ onto the nullspace $\mathcal{N}(F)$, then shift back by $x_0$."
Example 1: Projection onto a Line
We compute the projection of the vector $b = (2, 3)^\top$ onto the line spanned by the vector $u = (1, 1)^\top$. The projection $p$ is given by the formula: $$ p = \frac{u^\top b}{u^\top u} u = \frac{(1)(2) + (1)(3)}{(1)(1) + (1)(1)} \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \frac{5}{2} \begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 2.5 \\ 2.5 \end{pmatrix} $$ The residual vector is $r = b - p = (2 - 2.5, 3 - 2.5)^\top = (-0.5, 0.5)^\top$. We verify that the residual is orthogonal to the line: $u^\top r = (1)(-0.5) + (1)(0.5) = 0$.
8. The Method of Least Squares
The Problem: Overdetermined Systems
In practice, we often encounter linear systems $Ax=b$ with no exact solution. This typically occurs when $m > n$ (more equations than unknowns) and $b \notin \mathcal{R}(A)$. The "least squares" approach finds the best approximate solution by minimizing the sum of squared errors—the squared Euclidean norm of the residual $r = Ax-b$. This formulates the canonical unconstrained optimization problem:
$$ \min_{x \in \mathbb{R}^n} \|Ax - b\|_2^2 $$Forward Connection: Optimization Formulation
This problem is an instance of Unconstrained Convex Optimization. It minimizes a convex quadratic loss function. The solution methods we derive here (analytical gradients) generalize to the iterative descent methods (Gradient Descent, Newton's Method) covered in Lecture 13. Furthermore, it is a specific case of Quadratic Programming (QP) (Lecture 07), which adds linear constraints to the problem.
Geometric Interpretation: Projection
The solution to the least squares problem has a clean geometric interpretation. The vector $Ax$ lies in the column space of $A$, denoted $\mathcal{R}(A)$. The problem $\min_x \|Ax - b\|_2$ is therefore equivalent to finding the point $p \in \mathcal{R}(A)$ that is closest to $b$ in Euclidean distance.
The 1D Case: Projection onto a Vector
Before solving the general case, consider projecting $x$ onto the line spanned by a vector $y \neq 0$. We want to find $t$ to minimize $\|x - ty\|_2^2$.
$$ \phi(t) = \|x - ty\|_2^2 = \|x\|_2^2 - 2t(x^\top y) + t^2 \|y\|_2^2 $$
Taking the derivative with respect to $t$:
$$ \phi'(t) = -2(x^\top y) + 2t \|y\|_2^2 $$
Setting $\phi'(t) = 0$ gives the optimal scaling:
$$ t^\star = \frac{x^\top y}{\|y\|_2^2} $$
The projection is $p = t^\star y$.
Orthogonality Check: Let the residual be $r = x - p$. Then $y^\top r = y^\top x - t^\star y^\top y = y^\top x - \frac{x^\top y}{\|y\|^2} \|y\|^2 = 0$. The residual is orthogonal to $y$. This intuition generalizes to subspaces.
By the Pythagorean theorem, the closest point $p$ must satisfy a specific geometric condition: the error vector $b - p$ must be orthogonal to the subspace $\mathcal{R}(A)$. Think of dropping a perpendicular: The shortest path from a point to a plane (subspace) is the straight line perpendicular to that plane. If the error vector were not orthogonal, we could project it onto the subspace to find a "correction" that brings us closer to $b$, contradicting the optimality of $p$. Thus, the optimal $p = Ax^\star$ is the orthogonal projection of $b$ onto $\mathcal{R}(A)$.
The Normal Equations
The orthogonality condition, $(b - Ax^\star) \perp \mathcal{R}(A)$, implies that the residual vector must be orthogonal to every basis vector of the subspace. Since the columns of $A$ span $\mathcal{R}(A)$, the residual must be orthogonal to each column of $A$. This can be expressed compactly as: $$ A^\top (b - Ax^\star) = 0 $$ Expanding this yields the normal equations: $$ A^\top A x^\star = A^\top b $$ If the matrix $A$ has linearly independent columns (full column rank), then $A^\top A$ is invertible, and we can write the unique solution as $x^\star = (A^\top A)^{-1} A^\top b$.
Uniqueness of the Solution
- If $A$ has full column rank ($\mathrm{rank}(A) = n$), the solution $x^\star$ is unique.
- If $A$ is rank-deficient ($\mathrm{rank}(A) < n$), there are infinitely many solutions. In this case, the pseudoinverse is used to find the solution with the minimum norm.
Example 2: Solving Least Squares with Normal Equations
Consider the least squares problem for the system $Ax=b$ with: $$ A = \begin{pmatrix} 1 & 1 \\ 1 & -1 \\ 1 & 1 \end{pmatrix}, \quad b = \begin{pmatrix} 2 \\ 0 \\ 1 \end{pmatrix} $$ First, we form the normal equations $A^\top A x = A^\top b$: $$ A^\top A = \begin{pmatrix} 1 & 1 & 1 \\ 1 & -1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 1 \\ 1 & -1 \\ 1 & 1 \end{pmatrix} = \begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix} $$ $$ A^\top b = \begin{pmatrix} 1 & 1 & 1 \\ 1 & -1 & 1 \end{pmatrix} \begin{pmatrix} 2 \\ 0 \\ 1 \end{pmatrix} = \begin{pmatrix} 3 \\ 3 \end{pmatrix} $$ We solve the system $\begin{pmatrix} 3 & 1 \\ 1 & 3 \end{pmatrix} x = \begin{pmatrix} 3 \\ 3 \end{pmatrix}$, which yields the solution $x^\star = (3/4, 3/4)^\top$. The projection is $p = Ax^\star = (3/2, 0, 3/2)^\top$, and the residual is $r = b - p = (1/2, 0, -1/2)^\top$. We can verify the orthogonality condition: $A^\top r = (0, 0)^\top$.
Example 3: Rank-Deficient Case
Consider the system with: $$ A = \begin{pmatrix} 1 & 1 \\ 2 & 2 \end{pmatrix}, \quad b = \begin{pmatrix} 3 \\ 6 \end{pmatrix} $$ The columns of $A$ are linearly dependent, so the system is rank-deficient. The vector $b$ is in the column space of $A$, so there are infinitely many solutions. The normal equations are $A^\top A x = \begin{pmatrix} 5 & 5 \\ 5 & 5 \end{pmatrix} x = \begin{pmatrix} 15 \\ 15 \end{pmatrix}$. This system has infinitely many solutions of the form $x_1 + x_2 = 3$. The pseudoinverse would provide the minimum-norm solution, $x_1 = x_2 = 1.5$.
8.4 Variants: Weighted and Constrained Least Squares
(a) Weighted Least Squares (WLS)
Given SPD weight $W \in \mathbb{R}^{m \times m}$: $$ \min_x \|Ax - b\|_W^2 := (Ax - b)^\top W(Ax - b) $$ Let $C$ be a matrix such that $W = C^\top C$ (e.g., via Cholesky decomposition). Then the problem is equivalent to ordinary least squares in the whitened system, minimizing $\|C(Ax - b)\|_2^2 = \|CAx - Cb\|_2^2$. The normal equations are $A^\top W A x = A^\top W b$.
(b) Constrained to an Affine Set
Solve: $$ \min_x \|Ax - b\|_2^2 \quad \text{s.t.} \quad Fx = g $$ One method: parametrize $x = x_0 + Zy$, where $Fx_0 = g$ and columns of $Z$ form a basis for $\mathcal{N}(F)$. Then minimize $\|A(x_0 + Zy) - b\|_2^2$ over $y$ (an unconstrained LS). QR on $AZ$ is typically best.
9. Review & Cheat Sheet
This section condenses the lecture into a quick-reference format for definitions, properties, and standard results.
Definitions
| Concept | Definition | Key Properties |
|---|---|---|
| Subspace | Set $S$ closed under addition and scalar multiplication | Contains $\mathbf{0}$. Intersection of subspaces is a subspace. |
| Inner Product | $\langle x, y \rangle$ | Bilinear, Symmetric, Positive Definite. Enables angles and projection. |
| Norm | $\|x\|$ | Positivity, Homogeneity, Triangle Inequality. |
| PSD Matrix | $A \succeq 0$ | $x^\top A x \ge 0 \ \forall x$. Eigenvalues $\lambda_i \ge 0$. |
| Orthogonal Matrix | $Q^\top Q = I$ | Preserves norms/angles. Columns are orthonormal. |
Key Theorems
- Rank-Nullity: $\dim \mathcal{R}(A) + \dim \mathcal{N}(A) = n$ (for $A \in \mathbb{R}^{m \times n}$).
- Spectral Theorem: Real symmetric matrices have real eigenvalues and orthogonal eigenvectors.
- Schur Complement: $M \succeq 0 \iff D \succ 0$ and $A - BD^{-1}B^\top \succeq 0$.
- Fundamental Theorem of Linear Algebra: $\mathcal{R}(A) \perp \mathcal{N}(A^\top)$ and $\mathcal{R}(A^\top) \perp \mathcal{N}(A)$.
Standard Formulas
- Normal Equations: $A^\top A x = A^\top b$
- Projection onto Subspace: $P = A(A^\top A)^{-1}A^\top$ (if $A$ full rank)
- Cauchy-Schwarz: $|x^\top y| \le \|x\|_2 \|y\|_2$
- Gradient of Quadratic: $\nabla (x^\top A x) = (A+A^\top)x$
11. Exercises
Recap & Key Concepts
These exercises reinforce the foundational tools of linear algebra used in optimization. Focus on the geometry of subspaces, the calculus of gradients (crucial for finding optimality conditions), and the properties of PSD matrices (essential for convexity).
P0.1 — Linear Independence
Determine whether the following sets of vectors are linearly independent. If dependent, exhibit a linear combination summing to zero.
- $v_1 = (1, 2, 3)^\top, v_2 = (4, 5, 6)^\top, v_3 = (7, 8, 9)^\top$.
- $v_1 = (1, 0, 0)^\top, v_2 = (1, 1, 0)^\top, v_3 = (1, 1, 1)^\top$.
- The columns of an upper triangular matrix with non-zero diagonal entries.
Solution
- Dependent. Notice that the vectors are in an arithmetic progression. $v_2 - v_1 = (3, 3, 3)^\top$ and $v_3 - v_2 = (3, 3, 3)^\top$. Thus $v_2 - v_1 = v_3 - v_2$, which rearranges to $v_1 - 2v_2 + v_3 = 0$. This is a non-zero linear combination summing to zero.
- Independent. Form the matrix $A = [v_1, v_2, v_3]$. It is lower triangular with non-zero diagonal entries (all 1s). The determinant is the product of the diagonal entries, which is 1. Since $\det(A) \neq 0$, the columns are linearly independent.
- Independent. An upper triangular matrix $U$ with non-zero diagonal entries has determinant $\prod u_{ii} \neq 0$. Thus, its columns form a basis and are linearly independent.
Recap
Definition: A set of vectors $\{v_1, \dots, v_k\}$ is linearly independent if the only linear combination summing to zero is the trivial one ($\sum c_i v_i = 0 \implies c_i = 0 \ \forall i$).
Matrix View: The matrix $A = [v_1 \dots v_k]$ formed by these vectors has full column rank ($\mathrm{rank}(A) = k$) if and only if $\mathcal{N}(A) = \{0\}$.
Determinant Test (Square Case): For $n$ vectors in $\mathbb{R}^n$, they form a basis (independent and spanning) if and only if $\det([v_1 \dots v_n]) \neq 0$.
Geometric Intuition: Linearly dependent vectors are "redundant"; one can be written as a combination of the others, collapsing the span into a lower-dimensional subspace.
P0.2 — The Rank-Nullity Theorem
Let $A$ be a $10 \times 15$ matrix.
- What is the maximum possible rank of $A$?
- If the rank of $A$ is 8, what is the dimension of the nullspace $\mathcal{N}(A)$?
- If $A x = 0$ has only the solution $x=0$, is this possible? Explain.
Solution
- The rank is bounded by the dimensions: $\mathrm{rank}(A) \le \min(m, n) = \min(10, 15) = 10$.
- By the Rank-Nullity Theorem, $\dim(\mathcal{N}(A)) + \mathrm{rank}(A) = n$. Here $n=15$ (number of columns). So $\dim(\mathcal{N}(A)) = 15 - 8 = 7$.
- The condition "only the solution $x=0$" means $\mathcal{N}(A) = \{0\}$, so $\dim(\mathcal{N}(A)) = 0$. By Rank-Nullity, this would imply $\mathrm{rank}(A) = 15 - 0 = 15$. However, we established in (a) that the maximum rank is 10. Thus, this is impossible. An underdetermined system ($m < n$) always has a non-zero nullspace.
Recap
Rank-Nullity Theorem: For any matrix $A \in \mathbb{R}^{m \times n}$, the dimension of the domain splits into the nullspace and the row space:
$$ \dim \mathcal{R}(A) + \dim \mathcal{N}(A) = n $$
Rank Inequalities: $\mathrm{rank}(A) \le \min(m, n)$. Also, $\mathrm{rank}(AB) \le \min(\mathrm{rank}(A), \mathrm{rank}(B))$ and $\mathrm{rank}(A+B) \le \mathrm{rank}(A) + \mathrm{rank}(B)$.
Full Rank Conditions:
- Full Column Rank ($m \ge n$): $\mathrm{rank}(A) = n \iff \mathcal{N}(A) = \{0\} \iff A^\top A$ is invertible.
- Full Row Rank ($m \le n$): $\mathrm{rank}(A) = m \iff \mathcal{R}(A) = \mathbb{R}^m \iff AA^\top$ is invertible.
P0.3 — Orthogonal Complements
Let $S$ be a subspace of $\mathbb{R}^n$. The orthogonal complement is $S^\perp = \{y \mid y^\top x = 0 \ \forall x \in S\}$.
- Prove that $S^\perp$ is a subspace.
- Prove that $S \cap S^\perp = \{0\}$.
- If $S = \text{span}((1, 0, 0)^\top, (0, 1, 0)^\top)$, calculate $S^\perp$.
Solution
- Subspace: Let $y_1, y_2 \in S^\perp$. Then $y_1^\top x = 0$ and $y_2^\top x = 0$. For any linear combination $\alpha y_1 + \beta y_2$, we have $(\alpha y_1 + \beta y_2)^\top x = \alpha(y_1^\top x) + \beta(y_2^\top x) = 0$. Thus closed under linear combinations. Contains 0 since $0^\top x = 0$.
- Intersection: Let $x \in S \cap S^\perp$. Since $x \in S$ and $x \in S^\perp$, it must be orthogonal to itself: $x^\top x = 0$. $\|x\|^2 = 0 \implies x = 0$. Thus the intersection contains only the zero vector.
- Calculation: $S$ is the $xy$-plane ($z=0$). $S^\perp$ is the set of vectors orthogonal to $(1,0,0)$ and $(0,1,0)$. $y \cdot e_1 = y_1 = 0$. $y \cdot e_2 = y_2 = 0$. Thus $y = (0, 0, y_3)^\top$. $S^\perp$ is the $z$-axis (span of $e_3$).
Recap
Definition: The orthogonal complement $S^\perp$ contains all vectors orthogonal to every vector in the subspace $S$.
Fundamental Properties:
- Decomposition: $\mathbb{R}^n = S \oplus S^\perp$. Any vector $x$ can be uniquely written as $x = x_S + x_{\perp}$, with $x_S \in S$ and $x_{\perp} \in S^\perp$.
- Dimension: $\dim(S) + \dim(S^\perp) = n$.
- Duality: $(S^\perp)^\perp = S$.
Connection to Matrices: $\mathcal{R}(A)^\perp = \mathcal{N}(A^\top)$. (Fundamental Theorem of Linear Algebra).
P0.4 — Norm Equivalence
In finite dimensions, all norms are equivalent. For $x \in \mathbb{R}^n$, prove the following inequalities:
- $\|x\|_\infty \le \|x\|_2 \le \sqrt{n} \|x\|_\infty$
- $\|x\|_2 \le \|x\|_1 \le \sqrt{n} \|x\|_2$
Solution
- Left inequality: $\|x\|_\infty^2 = \max_i |x_i|^2 \le \sum_i x_i^2 = \|x\|_2^2$. Taking square roots gives $\|x\|_\infty \le \|x\|_2$.
Right inequality: $\|x\|_2^2 = \sum_i x_i^2 \le \sum_i (\max_j |x_j|)^2 = \sum_i \|x\|_\infty^2 = n \|x\|_\infty^2$. Taking square roots gives $\|x\|_2 \le \sqrt{n}\|x\|_\infty$. - Left inequality: Square $\|x\|_1$: $\|x\|_1^2 = (\sum |x_i|)^2 = \sum x_i^2 + \sum_{i \ne j} |x_i||x_j| = \|x\|_2^2 + \text{non-negative terms} \ge \|x\|_2^2$. Thus $\|x\|_1 \ge \|x\|_2$.
Right inequality: Use Cauchy-Schwarz with the vector of ones $\mathbf{1}$ and the vector $|x| = (|x_1|, \dots, |x_n|)$. $$ \|x\|_1 = \sum |x_i| \cdot 1 = |x|^\top \mathbf{1} \le \||x|\|_2 \|\mathbf{1}\|_2 = \|x\|_2 \sqrt{n} $$
Recap
Norm Equivalence Theorem: In finite-dimensional vector spaces, all norms are equivalent. For any two norms $\|\cdot\|_a, \|\cdot\|_b$, there exist constants $c, C > 0$ such that $c \|x\|_a \le \|x\|_b \le C \|x\|_a$. This implies they define the same topology (convergence sequences are the same).
Standard Inequalities:
- $\|x\|_\infty \le \|x\|_2 \le \|x\|_1$ (Hierarchy of norms)
- $\|x\|_2 \le \sqrt{n} \|x\|_\infty$
- $\|x\|_1 \le \sqrt{n} \|x\|_2$ (Cauchy-Schwarz with $\mathbf{1}$)
Geometry:
- $\ell_1$ ball: Cross-polytope (Diamond).
- $\ell_2$ ball: Sphere.
- $\ell_\infty$ ball: Hypercube.
P0.5 — Frobenius Norm Submultiplicativity
We proved $\|AB\|_F \le \|A\|_F \|B\|_F$ in the lecture using Cauchy-Schwarz.
Re-derive this result by writing $\|X\|_F^2 = \mathrm{tr}(X^\top X)$ and using the property $\mathrm{tr}(M) = \sum \lambda_i(M)$ along with the fact that $\lambda_{\max}(A^\top A) \le \mathrm{tr}(A^\top A)$.
Solution
Proof: Since $Q$ is symmetric, it has an eigendecomposition $Q = U \Lambda U^\top$ with $\lambda_i \ge 0$. $$ \mathrm{tr}(PQ) = \mathrm{tr}(P U \Lambda U^\top) = \mathrm{tr}(U^\top P U \Lambda) $$ Let $M = U^\top P U$. The diagonal entry $M_{ii} = u_i^\top P u_i$. Since $u_i$ is a unit vector (column of orthogonal matrix $U$), the Rayleigh quotient implies: $$ \lambda_{\min}(P) \le u_i^\top P u_i \le \lambda_{\max}(P) $$ Thus $M_{ii} \le \lambda_{\max}(P)$. $$ \mathrm{tr}(M \Lambda) = \sum_i M_{ii} \lambda_i \le \sum_i \lambda_{\max}(P) \lambda_i = \lambda_{\max}(P) \sum_i \lambda_i = \lambda_{\max}(P) \mathrm{tr}(Q) $$
Thus $\|AB\|_F^2 \le \|A\|_F^2 \|B\|_F^2$. Taking square roots gives the result.
Recap
Frobenius Norm: $\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}$. It measures the total "energy" of the matrix entries.
Algebraic Property: The Frobenius norm is submultiplicative (or consistent): $\|AB\|_F \le \|A\|_F \|B\|_F$.
Comparison: $\|A\|_2 \le \|A\|_F \le \sqrt{rank(A)} \|A\|_2$. The spectral norm is "tighter".
P0.6 — Spectral Norm Properties
Let $\|A\|_2$ denote the spectral norm (max singular value).
- Show that $\|A\|_2 = 0$ if and only if $A=0$.
- Show that $\|Q A\|_2 = \|A\|_2$ for any orthogonal matrix $Q$. (Orthogonal invariance).
Solution
- $\|A\|_2 = \sigma_{\max}(A)$. Singular values are non-negative. The max is 0 iff all singular values are 0. If $\Sigma=0$ in SVD $A=U\Sigma V^\top$, then $A=0$. Conversely if $A=0$, the maximum stretch is 0.
- $\|QA\|_2 = \sup_{\|x\|=1} \|QAx\|_2$. Since $Q$ is orthogonal, it preserves Euclidean norms: $\|QAx\|_2 = \|Ax\|_2$. Thus $\sup_{\|x\|=1} \|Ax\|_2 = \|A\|_2$. The spectral norm is invariant under left (and right) orthogonal multiplication.
Recap
Definition: The spectral norm (or operator norm) is the maximum amplification of a vector's length: $\|A\|_2 = \sup_{\|x\|_2=1} \|Ax\|_2$.
Calculation: $\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^\top A)}$.
Invariance: The spectral norm is orthogonally invariant. $\|Q A Z\|_2 = \|A\|_2$ for any orthogonal matrices $Q, Z$. It depends only on the singular values (intrinsic geometry).
P0.7 — Isometries
An isometry is a linear map $f(x) = Qx$ that preserves distances: $\|Qx - Qy\|_2 = \|x - y\|_2$ for all $x, y$.
Show that this condition implies $Q^\top Q = I$. Thus, isometries are represented by orthogonal matrices.
Solution
Recap
Isometry Definition: A linear map $Q$ is an isometry if $\|Qx\|_2 = \|x\|_2$ for all $x$.
Polarization Identity: Inner products can be recovered solely from norm evaluations:
$$ \langle x, y \rangle = \frac{1}{4} (\|x+y\|^2 - \|x-y\|^2) $$
Conclusion: Preserving lengths is equivalent to preserving angles (inner products). Thus, isometries in Euclidean space are exactly the orthogonal matrices ($Q^\top Q = I$).
P0.8 — Weighted Inner Product
Let $A \in \mathbb{S}^n_{++}$ be a symmetric positive definite matrix.
(a) Prove that $\langle x, y \rangle_A = x^\top A y$ satisfies all axioms of an inner product.
(b) Write down the Cauchy-Schwarz inequality for this inner product.
Solution
- Symmetry: $\langle x, y \rangle_A = x^\top A y = (x^\top A y)^\top = y^\top A^\top x = y^\top A x = \langle y, x \rangle_A$ (since $A=A^\top$).
Linearity: Linear in first arg (matrix multiplication distributes).
Positive Definiteness: $\langle x, x \rangle_A = x^\top A x$. Since $A \succ 0$, this is $>0$ for all $x \neq 0$. - Cauchy-Schwarz: $|\langle x, y \rangle_A| \le \sqrt{\langle x, x \rangle_A} \sqrt{\langle y, y \rangle_A}$. Explicitly: $|x^\top A y| \le \sqrt{x^\top A x} \sqrt{y^\top A y}$.
Recap
Weighted Inner Product: Given a Positive Definite matrix $P \succ 0$, the function $\langle x, y \rangle_P = x^\top P y$ defines a valid inner product.
Mahalanobis Distance: The induced norm $\|x\|_P = \sqrt{x^\top P x}$ measures distance accounting for correlations/scales defined by $P$.
Geometric Meaning: The unit ball $\{x \mid x^\top P x \le 1\}$ is an ellipsoid centered at the origin.
P0.9 — Characterization of Projectors
A matrix $P$ is an orthogonal projector if and only if it is idempotent ($P^2=P$) and symmetric ($P^\top=P$).
(a) Prove that if $P$ is an orthogonal projector onto a subspace $S$, it satisfies these conditions.
(b) Prove the converse.
Solution
- Let $P$ be the orthogonal projector onto $S$. Let $x = u + v$ where $u \in S, v \in S^\perp$. Then $Px = u$.
Idempotent: $P^2 x = P(Px) = Pu$. Since $u \in S$, $Pu = u$. Thus $P^2 x = u = Px$. So $P^2 = P$.
Symmetric: Check $\langle Px, y \rangle = \langle x, Py \rangle$. Decompose $x=x_S+x_\perp, y=y_S+y_\perp$. $\langle Px, y \rangle = \langle x_S, y_S+y_\perp \rangle = \langle x_S, y_S \rangle$ (since $S \perp S^\perp$). $\langle x, Py \rangle = \langle x_S+x_\perp, y_S \rangle = \langle x_S, y_S \rangle$. They are equal, so $P$ is symmetric. - Let $P=P^\top=P^2$. Define $S = \mathcal{R}(P)$. For any $x$, $x = Px + (I-P)x$. $Px \in S$. $(I-P)x$ is in $S^\perp$? Check orthogonality: $(Px)^\top (I-P)x = x^\top P^\top (I-P) x = x^\top (P - P^2) x$. Since $P^2=P$, this is 0. Thus $x$ is decomposed into a component in $S$ and a component orthogonal to $S$. $P$ maps $x$ to the component in $S$. This is the definition of an orthogonal projector.
Recap
Orthogonal Projector Theorem: A matrix $P$ represents an orthogonal projection onto some subspace if and only if:
- $P^2 = P$: It is a projection (idempotent). Applying it twice is the same as applying it once.
- $P^\top = P$: It is orthogonal (symmetric). The residual is orthogonal to the range.
Oblique Projectors: If $P^2=P$ but $P \neq P^\top$, it is an oblique projection (projects along a non-orthogonal angle).
P0.10 — Projection onto a Line
Let $a = (1, 1, 1)^\top$. Find the projection of $b = (1, 0, 0)^\top$ onto the line spanned by $a$. Verify that the residual $b - p$ is orthogonal to $a$.
Solution
Projection formula: $p = \frac{a^\top b}{a^\top a} a$.
$a^\top b = 1 \cdot 1 + 1 \cdot 0 + 1 \cdot 0 = 1$.
$a^\top a = 1^2 + 1^2 + 1^2 = 3$.
$p = \frac{1}{3} (1, 1, 1)^\top = (1/3, 1/3, 1/3)^\top$.
Residual $r = b - p = (1-1/3, 0-1/3, 0-1/3)^\top = (2/3, -1/3, -1/3)^\top$.
Check orthogonality: $a^\top r = 1(2/3) + 1(-1/3) + 1(-1/3) = 0$. Verified.
Recap
Vector Projection: The projection of a vector $b$ onto the line spanned by a non-zero vector $a$ is given by:
$$ \Pi_a(b) = \frac{\langle a, b \rangle}{\|a\|^2} a $$
Geometric Intuition: This scales $a$ by the "shadow" of $b$ onto $a$.
Projection Matrix: $P = \frac{a a^\top}{a^\top a}$ is a rank-1 orthogonal projector.
P0.11 — Projection onto a Hyperplane
Find the projection of the point $y = (3, 3)^\top$ onto the line (hyperplane in 2D) defined by $x_1 + x_2 = 2$.
Solution
The line is $a^\top x = b$ with $a=(1, 1)^\top, b=2$.
The formula for projection onto a hyperplane is $p = y - \frac{a^\top y - b}{\|a\|^2} a$.
$a^\top y = 3+3=6$.
$\|a\|^2 = 2$.
$p = (3, 3)^\top - \frac{6 - 2}{2} (1, 1)^\top = (3, 3)^\top - 2(1, 1)^\top = (3, 3)^\top - (2, 2)^\top = (1, 1)^\top$.
Check: $1+1=2$. Point is on the line. Residual $(2, 2)$ is parallel to normal $(1, 1)$. Correct.
Recap
Hyperplane Definition: A hyperplane is an affine set defined by a single linear equality constraint: $\mathcal{H} = \{x \mid a^\top x = b\}$.
Projection Formula: To project a point $y$ onto $\mathcal{H}$, you move in the direction of the normal vector $a$ (the most direct path) by exactly the amount needed to satisfy the equation:
$$ \Pi_{\mathcal{H}}(y) = y - \frac{a^\top y - b}{\|a\|^2} a $$
Signed Distance: The quantity $\frac{a^\top y - b}{\|a\|}$ is the signed distance from $y$ to the hyperplane.
P0.12 — Trace and Determinant
- Show that $\mathrm{tr}(ABC) = \mathrm{tr}(BCA)$ but generally $\mathrm{tr}(ABC) \neq \mathrm{tr}(BAC)$. Construct a $2 \times 2$ counterexample.
- Let $A \in \mathbb{R}^{n \times n}$ be skew-symmetric ($A^\top = -A$). Show that $x^\top A x = 0$ for all $x$.
- Use the result from (b) to prove that if $n$ is odd, $\det(A) = 0$. (Hint: $\det(A^\top) = \det(-A)$).
Solution
- Cyclic Property: Let $X = AB$. Then $\mathrm{tr}(XC) = \mathrm{tr}(CX)$ (basic cyclic property). Substituting $X=AB$, we get $\mathrm{tr}((AB)C) = \mathrm{tr}(C(AB)) = \mathrm{tr}(CAB)$. Applying it again gives $\mathrm{tr}(BCA)$.
Counterexample for Non-Cyclic: Let $A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}$, $B = \begin{bmatrix} 0 & 0 \\ 1 & 0 \end{bmatrix}$, $C = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}$.
$ABC = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \implies \mathrm{tr}=1$.
$BAC = \begin{bmatrix} 0 & 0 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} = \mathbf{0} \implies \mathrm{tr}=0$. - Skew-Symmetric Form: The scalar $x^\top A x$ is its own transpose. $$ x^\top A x = (x^\top A x)^\top = x^\top A^\top x $$ Since $A^\top = -A$, we have $x^\top A x = x^\top (-A) x = -x^\top A x$. The only number equal to its negative is 0. Thus $x^\top A x = 0$.
- Determinant of Skew-Symmetric: $$ \det(A) = \det(A^\top) = \det(-A) = (-1)^n \det(A) $$ If $n$ is odd, $(-1)^n = -1$. So $\det(A) = -\det(A)$, which implies $2\det(A) = 0 \implies \det(A) = 0$.
Recap
Trace Properties:
- Linearity: $\mathrm{tr}(\alpha A + \beta B) = \alpha \mathrm{tr}(A) + \beta \mathrm{tr}(B)$.
- Cyclic Invariance: $\mathrm{tr}(ABC) = \mathrm{tr}(BCA) = \mathrm{tr}(CAB)$. Note: $\mathrm{tr}(ABC) \neq \mathrm{tr}(BAC)$ in general.
Determinant Properties:
- Multiplicativity: $\det(AB) = \det(A)\det(B)$.
- Transpose: $\det(A^\top) = \det(A)$.
Skew-Symmetric Matrices ($A^\top = -A$):
- Quadratic Form: $x^\top A x = 0$ for all $x \in \mathbb{R}^n$.
- Eigenvalues: Purely imaginary or zero. If $n$ is odd, at least one eigenvalue is 0 ($\det(A)=0$).
P0.13 — Matrix Calculus Practice
Compute the gradient with respect to $X \in \mathbb{R}^{n \times n}$ for the following functions:
- $f(X) = \mathrm{tr}(A X B)$, where $A, B$ are constant matrices.
- $f(X) = \mathrm{tr}(X^\top X)$.
- $f(X) = a^\top X b$, where $a, b$ are constant vectors.
Solution
- Use the cyclic property: $\mathrm{tr}(AXB) = \mathrm{tr}(BA X)$. The gradient of $\mathrm{tr}(M X)$ is $M^\top$. Here $M = BA$. Thus $\nabla_X f(X) = (BA)^\top = A^\top B^\top$.
- $f(X) = \|X\|_F^2 = \sum_{ij} X_{ij}^2$. $\frac{\partial f}{\partial X_{ij}} = 2 X_{ij}$. Thus $\nabla_X f(X) = 2X$.
- $a^\top X b = \mathrm{tr}(a^\top X b) = \mathrm{tr}(b a^\top X)$. Using the rule from (a) with $M = b a^\top$: $\nabla_X f(X) = (b a^\top)^\top = (a^\top)^\top b^\top = a b^\top$. (This is an outer product matrix).
Recap
Matrix Calculus Toolkit:
Definition: The gradient $\nabla_X f(X)$ is the matrix $G$ such that the first-order approximation is $f(X+H) \approx f(X) + \langle G, H \rangle = f(X) + \mathrm{tr}(G^\top H)$.
Key Formulas:
- $\nabla_X \mathrm{tr}(A X) = A^\top$.
- $\nabla_X \mathrm{tr}(X^\top A X) = (A + A^\top)X$. (For symmetric $A$, $2AX$).
- $\nabla_X \|X\|_F^2 = 2X$.
Strategy: Perturb $X \to X + H$, expand terms linear in $H$, and rearrange trace terms to match the form $\mathrm{tr}(G^\top H)$.
P0.14 — Testing Positive Semidefiniteness
Determine whether the following matrices are PSD, PD, Indefinite, or Negative Definite/Semidefinite.
- $A = \begin{bmatrix} 2 & -1 \\ -1 & 2 \end{bmatrix}$
- $B = \begin{bmatrix} 1 & 2 \\ 2 & 1 \end{bmatrix}$
- $C = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 3 \end{bmatrix}$
Solution
- Positive Definite. Using Sylvester's Criterion (Leading Principal Minors): $D_1 = 2 > 0$, $D_2 = \det(A) = 4 - 1 = 3 > 0$. Since all leading principal minors are positive, $A \succ 0$. Alternatively, eigenvalues are $\lambda = 1, 3$.
- Indefinite. The determinant is $\det(B) = 1 - 4 = -3 < 0$. Since the product of eigenvalues is negative, they must have opposite signs ($\lambda = 3, -1$). Thus, $B$ is indefinite.
- Positive Semidefinite. This is a diagonal matrix with entries $1, 0, 3$, which are the eigenvalues. Since all $\lambda_i \ge 0$ and one is zero, $C \succeq 0$ but is not positive definite.
Recap
Characterizing Definiteness:
- Positive Definite (PD): All $\lambda_i > 0$. Strictly convex bowl.
- Positive Semidefinite (PSD): All $\lambda_i \ge 0$. Convex bowl (possibly flat).
- Indefinite: Some $\lambda_i > 0$, some $\lambda_i < 0$. Saddle point geometry.
- Negative Definite (ND): All $\lambda_i < 0$. Concave cap.
Practical Checks:
- Check determinant (product of eigenvalues) and trace (sum of eigenvalues).
- If $\det(A) < 0$, it must have at least one negative eigenvalue (assuming $n$ even? No, if $\det < 0$ product is negative, so odd number of negative eigenvals. If $n=2$, exactly one neg).
P0.15 — Schur Complement Application
Use the Schur Complement condition (assuming the top-left pivot is positive) to determine the range of $x$ for which the matrix $M(x)$ is Positive Semidefinite:
$$ M(x) = \begin{bmatrix} x & 1 \\ 1 & x \end{bmatrix} $$Verify your answer by computing the eigenvalues directly.
Solution
Schur Complement Method: For $M \succeq 0$, we need the top-left block $A=x > 0$ and the Schur complement $S = D - C A^{-1} B \ge 0$. Here $A=x, B=1, C=1, D=x$. $S = x - 1(1/x)(1) = x - 1/x$. We need $x > 0$ and $x - 1/x \ge 0$. $x - 1/x \ge 0 \implies x^2 \ge 1$ (since $x>0$). Thus $x \ge 1$. So the range is $x \ge 1$.
Eigenvalue Verification: Characteristic eq: $(x-\lambda)^2 - 1 = 0 \implies x-\lambda = \pm 1 \implies \lambda = x \pm 1$. For PSD, we need $\lambda_{\min} = x-1 \ge 0$ and $\lambda_{\max} = x+1 \ge 0$. $x-1 \ge 0 \implies x \ge 1$. This automatically satisfies $x+1 \ge 2 \ge 0$. Matches.
Recap
Schur Complement Lemma: A block matrix $M = \begin{bmatrix} A & B \\ B^\top & C \end{bmatrix}$ (with $A \succ 0$) is PSD if and only if the Schur complement $S = C - B^\top A^{-1} B$ is PSD.
LMI (Linear Matrix Inequality): A constraint of the form $F(x) \succeq 0$ where $F$ is affine in $x$.
Trick: Use Schur complements to convert nonlinear constraints (like $x^2 \le y$ or quadratic-over-linear forms) into equivalent Linear Matrix Inequalities. This allows them to be solved using Semidefinite Programming (SDP).
P0.16 — Loewner Order Transitivity
The Loewner order is defined as $X \succeq Y \iff X - Y \succeq 0$. Prove that this order is transitive: if $X \succeq Y$ and $Y \succeq Z$, then $X \succeq Z$. Use the variational definition of PSD matrices.
Solution
Let $v \in \mathbb{R}^n$ be any vector. $X \succeq Y \implies v^\top (X-Y) v \ge 0 \implies v^\top X v \ge v^\top Y v$. $Y \succeq Z \implies v^\top (Y-Z) v \ge 0 \implies v^\top Y v \ge v^\top Z v$. combining inequalities: $v^\top X v \ge v^\top Y v \ge v^\top Z v$. Thus $v^\top (X-Z) v \ge 0$. Since this holds for all $v$, $X - Z \succeq 0$, so $X \succeq Z$.
Recap
Loewner Order ($\succeq$): A partial ordering on the set of symmetric matrices defined by $X \succeq Y \iff X - Y \succeq 0$ (i.e., $X-Y$ is PSD).
Key Properties:
- Transitivity: $X \succeq Y$ and $Y \succeq Z \implies X \succeq Z$.
- Additivity: $X \succeq Y \implies X+Z \succeq Y+Z$.
- Congruence: $X \succeq Y \implies T^\top X T \succeq T^\top Y T$ for any $T$.
- Inversion: If $X, Y \succ 0$, then $X \succeq Y \implies Y^{-1} \succeq X^{-1}$ (order reversal).
P0.17 — PSD Cone in 2D
Consider the space of $2 \times 2$ symmetric matrices, which has dimension 3.
Write down the explicit inequalities defining the PSD cone $S \succeq 0$ in terms of the matrix entries $x, y, z$ (where $S = \begin{bmatrix} x & y \\ y & z \end{bmatrix}$). Relate this to the trace and determinant conditions.
Solution
A $2 \times 2$ matrix is PSD if and only if:
1. Trace $\ge 0$: $x + z \ge 0$.
2. Determinant $\ge 0$: $xz - y^2 \ge 0$.
Note that $xz \ge y^2 \ge 0$ implies $x$ and $z$ have the same sign.
Since their sum is non-negative, both must be non-negative: $x \ge 0, z \ge 0$.
Thus the conditions are:
$$ x \ge 0, \quad z \ge 0, \quad xz \ge y^2 $$
Geometrically, this is a cone in $\mathbb{R}^3$. The boundary $xz=y^2$ is a rotated quadratic cone surface.
Recap
The PSD Cone $\mathbb{S}^n_+$: The set of all symmetric positive semidefinite matrices forms a convex cone in $\mathbb{S}^n$.
Geometry (2x2): In the 3D space of entries $(x, y, z)$, the PSD cone is defined by $x \ge 0, z \ge 0, xz \ge y^2$. This looks like an "ice cream cone" with a non-circular cross-section.
Boundary: The boundary of the cone consists of singular PSD matrices (rank deficient, $\det(X)=0$).
Interior: The interior consists of Positive Definite (PD) matrices ($\det(X)>0, x>0$).
P0.18 — Hessian of a Cubic
Let $f: \mathbb{R}^2 \to \mathbb{R}$ be defined by $f(x) = x_1^3 + x_2^3 + 2x_1 x_2$.
- Compute the gradient $\nabla f(x)$.
- Compute the Hessian matrix $\nabla^2 f(x)$.
- For which $x$ is the Hessian Positive Semidefinite? (This identifies the region where the function is locally convex).
Solution
- Gradient: $$ \frac{\partial f}{\partial x_1} = 3x_1^2 + 2x_2, \quad \frac{\partial f}{\partial x_2} = 3x_2^2 + 2x_1 $$ $\nabla f(x) = \begin{bmatrix} 3x_1^2 + 2x_2 \\ 3x_2^2 + 2x_1 \end{bmatrix}$.
- Hessian: $$ \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix} = \begin{bmatrix} 6x_1 & 2 \\ 2 & 6x_2 \end{bmatrix} $$
- PSD Condition: A $2 \times 2$ matrix is PSD iff trace $\ge 0$ and determinant $\ge 0$. Trace: $6x_1 + 6x_2 \ge 0 \implies x_1 + x_2 \ge 0$. Determinant: $36x_1 x_2 - 4 \ge 0 \implies 9x_1 x_2 \ge 1$. The condition $x_1 x_2 \ge 1/9$ implies $x_1, x_2$ have the same sign. If both negative, $x_1 + x_2 < 0$, violating the trace condition. Thus, we need $x_1 > 0, x_2 > 0$ and $x_1 x_2 \ge 1/9$. This region (hyperbola in the first quadrant) is where the function is locally convex.
Recap
Convexity Condition: A twice-differentiable function is convex on a domain iff its Hessian matrix is Positive Semidefinite ($\nabla^2 f(x) \succeq 0$) everywhere in that domain.
Sylvester's Criterion (for $A \succ 0$): A symmetric matrix is Positive Definite iff all leading principal minors (determinants of top-left $k \times k$ submatrices) are positive.
2x2 PSD Test: $\begin{bmatrix} a & b \\ b & c \end{bmatrix} \succeq 0 \iff a \ge 0, c \ge 0, ac - b^2 \ge 0$. (Trace $\ge 0$ and Det $\ge 0$).
P0.19 — Least Squares from Scratch
Consider the function $f(x) = \frac{1}{2} \|Ax - b\|_2^2$.
- Expand the squared norm into terms involving $x^\top A^\top A x$, etc.
- Compute the gradient $\nabla f(x)$ step-by-step.
- Set the gradient to zero to derive the Normal Equations.
- Show that if $\mathcal{N}(A) = \{0\}$, the Hessian is positive definite, ensuring a unique global minimum.
Solution
- $f(x) = \frac{1}{2}(Ax - b)^\top (Ax - b) = \frac{1}{2}(x^\top A^\top - b^\top)(Ax - b) = \frac{1}{2} x^\top A^\top A x - \frac{1}{2} x^\top A^\top b - \frac{1}{2} b^\top A x + \frac{1}{2} b^\top b$. Since scalar transpose is identity ($x^\top A^\top b = b^\top A x$), we simplify to: $$ f(x) = \frac{1}{2} x^\top A^\top A x - b^\top A x + \frac{1}{2} b^\top b $$
- Gradient Derivation:
Term 1: $\frac{1}{2} x^\top (A^\top A) x$. Let $Q = A^\top A$ (symmetric). $$ \nabla (\frac{1}{2} x^\top Q x) = \frac{1}{2} (Q + Q^\top) x = \frac{1}{2} (2Q) x = Qx = A^\top A x $$
Term 2: $-b^\top A x$. This is linear in $x$. We can rewrite it as $-(A^\top b)^\top x$. The gradient of $c^\top x$ is $c$. Here $c = -(A^\top b)$. $$ \nabla (-b^\top A x) = -A^\top b $$
Term 3: $\frac{1}{2} b^\top b$. Constant w.r.t $x$, gradient is 0.
Result: $\nabla f(x) = A^\top A x - A^\top b = A^\top (Ax - b)$. - $\nabla f(x) = 0 \implies A^\top A x - A^\top b = 0 \implies A^\top A x = A^\top b$. These are the Normal Equations.
- $\nabla^2 f(x) = A^\top A$. If $\mathcal{N}(A) = \{0\}$, then for any $v \neq 0$, $Av \neq 0$. $$ v^\top A^\top A v = (Av)^\top (Av) = \|Av\|_2^2 > 0 $$ Thus the Hessian is positive definite, which guarantees strict convexity and a unique global minimum.
Recap
Least Squares Objective: $f(x) = \frac{1}{2}\|Ax - b\|_2^2$ is a convex quadratic function.
Gradients:
- $\nabla (\frac{1}{2} x^\top Q x) = Qx$ (for symmetric $Q$).
- $\nabla (c^\top x) = c$.
Normal Equations: $A^\top A x = A^\top b$. This system represents the condition $A^\top (Ax - b) = 0$, meaning the residual is orthogonal to the columns of $A$.
Convexity & Uniqueness:
- The Hessian $\nabla^2 f(x) = A^\top A$ is always PSD ($\succeq 0$), ensuring convexity.
- If $\mathcal{N}(A) = \{0\}$ (Full Column Rank), $A^\top A$ is PD ($\succ 0$), ensuring strict convexity and a unique global minimum.
P0.20 — General Quadratic Minimization
Solve the unconstrained minimization problem for a general quadratic function: $$ \min_{x \in \mathbb{R}^n} f(x) := \frac{1}{2} x^\top H x + g^\top x + c $$ where $H \in \mathbb{S}_{++}^n$ (Positive Definite).
Solution
By Lemma A2 (Quadratic Minimization), the minimizer is obtained by setting the gradient to zero. $$ \nabla f(x) = Hx + g = 0 \implies x^\star = -H^{-1}g $$ Substituting this back into the objective to find the optimal value: $$ f(x^\star) = \frac{1}{2} (-H^{-1}g)^\top H (-H^{-1}g) + g^\top (-H^{-1}g) + c $$ $$ = \frac{1}{2} g^\top H^{-1} H H^{-1} g - g^\top H^{-1} g + c $$ $$ = \frac{1}{2} g^\top H^{-1} g - g^\top H^{-1} g + c = c - \frac{1}{2} g^\top H^{-1} g $$ This formula is the foundation for almost all dual derivations (where we minimize the Lagrangian w.r.t $x$).
Recap
Quadratic Forms: Every convex quadratic problem reduces to solving a linear system ($Hx = -g$).
Completing the Square: The minimum value is always "constant minus quadratic form of the gradient inverse".
11. Readings & Resources
-
Boyd & Vandenberghe, Convex Optimization:
- Appendix A: Mathematical Background (Norms, Analysis, Functions)
- Appendix C: Numerical Linear Algebra (Operations, Factorizations)
-
Gilbert Strang, Introduction to Linear Algebra:
- Chapter 2: Vector Spaces
- Chapter 3: Orthogonality
- Chapter 6: Eigenvalues and Eigenvectors
-
Golub & Van Loan, Matrix Computations:
- Chapter 2: Matrix Multiplication (for numerical aspects)
- Chapter 5: Orthogonalization and Least Squares